Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rapids] removed spark tests, updated to a more recent rapids release #1219

Merged
merged 96 commits into from
Oct 26, 2024

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Aug 8, 2024

Tested with CUDA=11 and CUDA=12

@cjac cjac requested a review from prince-cs August 8, 2024 16:36
@cjac
Copy link
Contributor Author

cjac commented Aug 8, 2024

I prefer this to #1218

@cjac cjac marked this pull request as draft August 8, 2024 16:37
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

@cjac cjac changed the title [rapids] tested with 24.06 [rapids] tested with CUDA11+22.08 and CUDA12+24.06 Aug 9, 2024
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

2 similar comments
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

@prince-cs
Copy link
Collaborator

/gcbrun

@prince-cs
Copy link
Collaborator

Should we increase the machine type from n1-standard-4 to n1-standard-16

@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

cuda11 has been manually tested with all versions.
dataproc 2.0 images all pass the automated tests and can be assumed to work with cuda12 as well
Trying cuda12 on 2.1 and 2.2 now.

@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

tests are failing for

  • 2.1-debian11
  • 2.1-rocky8
  • 2.1-ubuntu20
  • 2.2-debian12
  • 2.2-rocky9
  • 2.2-ubuntu22

@cjac cjac changed the title [rapids] tested with CUDA11+22.08 and CUDA12+24.06 [rapids] tested with CUDA11+22.06 and CUDA12+24.06 Aug 9, 2024
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

[edit: this was a misconfiguration in the systemd unit]

It looks like the dask infrastructure is out of date and I'll have to target 2023.12 instead.

root@cluster-1718310842-m:~# /opt/conda/miniconda3/envs/dask/bin/python /tmp/init/dask/verify_dask_standalone.py 
/opt/conda/miniconda3/envs/dask/lib/python3.11/site-packages/distributed/client.py:1394: VersionMismatchWarning: Mismatched versions found

+-------------+----------------+----------------+---------+
| Package     | Client         | Scheduler      | Workers |
+-------------+----------------+----------------+---------+
| dask        | 2024.6.2       | 2023.12.1      | None    |
| distributed | 2024.6.2       | 2023.12.1      | None    |
| python      | 3.11.9.final.0 | 3.11.8.final.0 | None    |
| tornado     | 6.4.1          | 6.3.3          | None    |
+-------------+----------------+----------------+---------+

@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

I also need to reduce the python abi to 3.10

@cjac cjac changed the title [rapids] tested with CUDA11+22.06 and CUDA12+24.06 [rapids] tested with CUDA11+22.06 and CUDA12+23.12 Aug 10, 2024
@cjac cjac changed the title [rapids] tested with CUDA11+22.06 and CUDA12+23.12 [rapids,dask] tested with CUDA11+22.06 and CUDA12+23.12 Aug 10, 2024
@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

/gcbrun

6 similar comments
@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 11, 2024

/gcbrun

@cjac cjac force-pushed the rapids-20240806 branch from d987ab1 to db71b01 Compare August 11, 2024 01:41
@cjac
Copy link
Contributor Author

cjac commented Oct 25, 2024

All versions aside from 2.2-debian12 succeeded.

Build ID: 36ac2daa-1c2d-483c-8f0b-b3359fee06b9

Commit: e8a44fe

2.2-debian12 failed because of a conda mirror failure. I've opened conda/infrastructure#1051 to track. It was running a dask install, which it shouldn't have been doing. So I've removed the delta from master of the dask/ directory

@cjac
Copy link
Contributor Author

cjac commented Oct 25, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 25, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 26, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 26, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 26, 2024

Ach! Finally!

@jakirkham
Copy link

Hooray! 🥳

Thank you for your persistence here CJ! 🙏 👏

@cjac
Copy link
Contributor Author

cjac commented Oct 26, 2024

Hooray! 🥳

Thank you for your persistence here CJ! 🙏 👏

Thanks for noticing!

eeyore

This reminds me of a tool song. I feel like an excess of optimism is the only thing that keeps me going sometimes ;-)

@cjac
Copy link
Contributor Author

cjac commented Oct 26, 2024

okay, result of trying dask>=2024.8 instead of dask>=2024.5:

+ /opt/conda/miniconda3/bin/mamba create -m -n dask-rapids -y --no-channel-priority -c conda-forge -c nvidia -c rapidsai 'cuda-version>=12,<13' rapids=24.08 'dask>=2024.8' dask-bigquery dask-ml dask-sql cudf numba 'python>=3.11'

Looking for: ["cuda-version[version='>=12,<13']", 'rapids=24.08', "dask[version='>=2024.8']", 'dask-bigquery', 'dask-ml', 'dask-sql', 'cudf', 'numba', "python[version='>=3.11']"]

Encountered problems while solving:
  - package rapids-24.08.00-cuda11_py310_240808_g86654f0_0 requires custreamz 24.08.*, but none of the providers can be installed
  
  
 + /opt/conda/miniconda3/bin/conda create -m -n dask-rapids -y --no-channel-priority -c conda-forge -c nvidia -c rapidsai 'cuda-version>=12,<13' rapids=24.08 'dask>=2024.8' dask-bigquery dask-ml dask-sql cudf numba 'python>=3.11'

Solving environment: ...working... failed

LibMambaUnsatisfiableError: Encountered problems while solving:
  - package rapids-24.08.00-cuda11_py310_240808_g86654f0_0 requires custreamz 24.08.*, but none of the providers can be installed

Could not solve for environment specs
The following packages are incompatible
├─ dask >=2024.8  is requested and can be installed;                                                                                                                                                
└─ rapids 24.08**  is not installable because it requires
   └─ custreamz 24.08.* , which requires
      └─ rapids-dask-dependency 24.08.* , which requires
         └─ dask 2024.7.1 , which conflicts with any installable versions previously reported.

@cjac
Copy link
Contributor Author

cjac commented Oct 26, 2024

I'll use "rapids>=${RAPIDS_VERSION}" instead of "rapids=${RAPIDS_VERSION}" as the rapids_spec ; that way we'll pick up a newer version if it can be solved for. Nothing newer than 24.08 for rapids, but dask picked up 2024.7 rather than 2024.5, so I'll encode that into the current state of the patch.

… dask>=2024.7 ; correctly capturing retval of installer program
@cjac
Copy link
Contributor Author

cjac commented Oct 26, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 26, 2024

Two in a row!

@cjac cjac merged commit 90b1bd1 into GoogleCloudDataproc:master Oct 26, 2024
2 checks passed
Copy link
Member

@Deependra-Patel Deependra-Patel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to fully understand the PR tbh, there are too many changes clubbed together without reasoning why we changed the behavior, PR description is not helping either. We should have split this up.

@@ -70,6 +70,7 @@ determine_tests_to_run() {
changed_dir="${changed_dir%%/*}/"
# Run all tests if common directories modified
if [[ ${changed_dir} =~ ^(integration_tests|util|cloudbuild)/$ ]]; then
continue # remove this before squash/merge
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

boot_disk_size="200GB",
timeout_in_minutes=30)
boot_disk_size="50GB",
timeout_in_minutes=60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too long timeout, is 30 mins not enough for a cluster to be up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it takes much time as rapids requires gpu driver as well. So, in order to get rapids up and running we first need to install gpu drivers and that takes up quite some time.

("STANDARD", ["m"], GPU_T4, "standalone"))

@parameterized.parameters(
# If a new version of dask-yarn is released, add this test back in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these not supported anymore? Also, let's update the init actions README as well.

if [[ -z "${IMAGE_VERSION}" ]] ; then
IMAGE_VERSION="$(jq -r .IMAGE_VERSION env.json)" ; fi ; export IMAGE_VERSION

#declare -a TESTS_TO_RUN=('dask:test_dask' 'rapids:test_rapids')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code, let's remove this

if [[ "${ROLE}" == "Master" ]]; then
systemctl restart hadoop-yarn-resourcemanager.service
# Restart NodeManager on Master as well if this is a single-node-cluster.
if systemctl status hadoop-yarn-nodemanager; then
if systemctl list-units | grep hadoop-yarn-nodemanager; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: Why was this change needed?

@@ -0,0 +1,17 @@
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these files committed by mistake?

local python_ver="3.10"
else
local python_ver="3.9"
function configure_knox_for_dask() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to configure knox now, earlier it was not needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants